Data Science for Biologists
Associate Professor in Data Science and Genetics at the University of East Anglia.
Academic background in Behavioural Ecology, Genetics, and Insect Pest Control.
Teach Genetics, Programming, and Statistics
One workshop per week
One lecture per week
One assignment per week
One ‘capstone’ project
I hope you end up with more questions than answers!
For Research to be reproducible both data and methods should be available.
Applying the described methods to the data leads to the same results
In theory, method availability ≠ code
But with complex data and analyses - are methods of data collection enough?
Science advances incrementally by identifying and rectifying errors over time
Peer review: Critical evaluation of papers by experts maintain quality
Independent studies either support or fail to replicate findings
Publication bias: preference for positive results
Pressure to publish
Poor study designs and statistical issues
Lack of transparency
The reproducibility crisis emerged when numerous studies, especially in fields like psychology, medicine, and biology, failed to be replicated by other researchers.
High-profile replication attempts revealed that many published results could not be consistently reproduced, raising doubts about their validity.
Recognition that no study should be considered ‘definitive’
Empower lasting systemic change through increased transparency in research methods, data sharing and reporting
Structural change in academic culture
Open science is a global movement that aims to make scientific research and its outcomes freely accessible to everyone. By fostering practices like data sharing and preregistration, open science not only accelerates scientific progress but also strengthens trust in research findings.
UK Reproducibility Network - funded by UK Research Council
46 member institutions (UEA is one)
Establish open research practices across UK Research
/home/phil/Documents/paper
├── abstract.R
├── correlation.png
├── data.csv
├── data2.csv
├── fig1.png
├── figure 2 (copy).png
├── figure.png
├── figure1.png
├── figure10.png
├── partial data.csv
├── script.R
└── script_final.R
README
Documented
Easy to code with
All files are inside the root folder
What do you think are the contents of these files:
data/raw/madrid_minimum-temperature.csv
scripts/02_compute_mean-temperature.R
analysis/01_madrid_minimum-temperature_descriptive-statistics.qmd
Come up with good names for these:
a dataset of cats with columns for weight, length, tail length, fur colour(s), fur type and name.
a script that downloads data from Spotify.
a scripts that cleans up data.
a scripts that fits a linear discriminant model and saves it to a file.
Use projects
Check your code runs on blank slates
Automates the creation of a paper or report
Saves time
Reduces errors
(https://www.nature.com/articles/d41586-022-00563-z)
Discovering Statistics - Andy Field
An Introduction to Generalized Linear Models - Dobson & Barnett
An Introduction to Statistical Learning with Applications in R - James, Witten, Hastie & Tibshirani
Mixed Effects Models and Extensions in Ecology with R - Zuur, et al.
Ecological Statistics with contemporary theory and application
The Big Book of R (https://www.bigbookofr.com/)
Writing statistical methods for ecologists
Reporting statistical methods and outcome of statistical analyses in research articles
Design principles for data analysis
Log-transformation and its implications for data analysis.
Effect size, confidence interval and statistical significance: a practical guide for biologists
Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing
Ten common statistical mistakes to watch out for when writing or reviewing a manuscript.
Why most published research findings are false
Model averaging and muddled multimodel inference
A brief introduction to mixed effects modelling and multi-model inference in ecology
The Practical Alternative to the p Value Is the Correctly Used p Value